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Abstract 

We introduce a parametric form of pooling, based on a 
Gaussian, which can be optimized alongside the features 
in a single global objective function. By contrast, existing 
pooling schemes are based on heuristics (e.g. local max- 
imum) and have no clear link to the cost function of the 
model. Furthermore, the variables of the Gaussian explic- 
itly store location information, distinct from the appearance 
captured by the features, thus providing a what/where de- 
composition of the input signal. Although the differentiable 
pooling scheme can be incorporated in a wide range of hi- 
erarchical models, we demonstrate it in the context of a De- 
convolutional Network model (Zeiler et al. [22]). We also 
explore a number of secondary issues within this model and 
present detailed experiments on MNIST digits. 

1. Introduction 

A number of recent approaches in vision and machine 
learning have explored hierarchical representations for im- 
ages and video, with the goal of learning features for object 
recognition. One class of methods, for example Convolu- 
tional Neural Networks [ ] or the recent RICA model of 
Le et al. [ use a purely feed- forward hierarchy that maps 
the input image to a set of features which are presented to a 
simple classifier. Another class of models attempts to build 
hierarchical generative models of the data. These include 
Deep Belief Networks [ ], Deep Boltzmann Machines [ ] 
and the Compositional Models of Zhu et al. [23, 4]. 

Spatial pooling is a key mechanism in all these hierar- 
chical image representations, giving invariance to local per- 
turbations of the input and allowing higher-level features to 
model large portions of the image. Sum and max pooling 
are the most common forms, with max being typically pre- 
ferred (see Boureau et al. [ ] for an analysis). 

In this paper we introduce a parametric form of pool- 
ing that can be directly integrated into the overall objective 
function of many hierarchical models. Using a Gaussian 
parametric model, we can directly optimize the mean and 
variance of each Gaussian pooling region during inference 
to minimize a global objective function. This contrasts with 



existing pooling methods that just optimize a local crite- 
rion (e.g. max over a region). Adjusting the variance of 
each Gaussian allows a smooth transition between select- 
ing a single element (akin to max pooling) over the pooling 
region, or averaging over it (like a sum operation). 

Integrating pooling into the objective facilitates joint 
training and inference across all layers of the hierarchy, 
something that is often a major issue in many deep models. 
During training, most approaches build up layer-by-layer, 
holding the output of the layer beneath fixed. However, this 
is sub-optimal, since the features in the low-layers cannot 
use top-down information from a higher layer to improve 
them. A few approaches do perform full joint training of 
the layers, notably the Deep Boltzmann Machine [ ], and 
Eslami et al. [ ], as applied to images, and the Deep Energy 
Models of Ngiam et al. [1. ]. We demonstrate our differ- 
entiable pooling in a third model with this capability, the 
Deconvolutional Networks of Zeiler et al. [ ]. This is a 
simple sparse-coding model that can be easily stacked and 
we show how joint inference and training of all layers is 
possible, using the differentiable pooling. However, differ- 
entiable pooling is not confined to the Deconvolutional Net- 
work model - it is capable of being incorporated into many 
existing hierarchical models. 

The latent variables that control the Gaussians in our 
pooling scheme store location information ("where"), dis- 
tinct from the features that capture appearance ("what"). 
This separation of what/where is also present in Ran- 
zato et al. [ ], the transforming auto-encoders of Hinton 
et al. [ ], and Zeiler et al. [ ]. 

In this paper, we also explore a number of secondary is- 
sues that help with training deep models: non-negativity 
constraints; different forms of sparsity; overcoming local 
minima during inference and different sparsity levels dur- 
ing training and testing. 

2. Model Overview 

We explain our contributions in the context of a Decon- 
volutional Network, introduced by Zeiler et al. p^]. This 
model is a hierarchical form of convolutional sparse coding 
that can learn invariant image features in an unsupervised 
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Figure 1. (a): A 2-layer model architecture, (b): Schematic of inference in a two layer model, (c): Illustration of the Gaussian parameteri- 
zation used in our differentiable pooling. 



manner. Its simplicity allows the easy integration of differ- 
entiable pooling and is amenable to joint inference over all 
layers. 

Let us start by reviewing a single Deconvolutional Net- 
work layer, presented with an input image (having c color 
channels). The goal is to produce a reconstruction v from 
sparse features p, that is close to v. We achieve this by min- 
imizing: 



(1) 



where A is a hyper-parameter that controls the influence of 
the reconstruction term, p consists of a set of B 2-D fea- 
ture maps, thus forming an over complete-basis. To give 
a unique solution, a sparsity constraint on p is needed and 
we use an element- wise pseudo-norm where 0.5 < a < 1. 
The reconstruction v is produced from p by two sub-layers: 
Unpooling and Convolution. 

2.1. Unpooling 

In the unpooling sub-stage, each 2D feature map un- 
dergoes an unpooling operation to produce a larger 2D un- 
pooled feature map Each element j in p5 influences 
a small neighborhood Nj (typically 2 x 2 or 3 x 3) in the 
unpooled map 2:5, via a set of weights w{i) within the neigh- 
borhood: 



We constrain the weights w{i) to have unit ^2 -norm, as 
this makes the unpooling operation invertible^. The inverse 
pooling operation computes each element j in as the sum 
of weights w{i) in neighborhood Nj of the unpooled map 

Ph{j) = Yl (3) 

ieNj 

In Zeiler et al. [ ], max (un)pooling was used, equivalent 
to w{i) being all zero, except for a single element set to 1. 
In this work, we consider more general w{iys, as detailed 
in Section 2.5, treating them as latent variables which will 
be inferred for each input image. Note that each element in 
p has its own set of i(;'s. 

For the rest of the paper, we consider the neighborhoods 
Nj to be non-overlapping, but the above formulation gener- 
alizes to overlapping regions as well. For brevity, we write 
the unpooling operation as a single linear matrix, parame- 
terized by weights w\ z = U^p. 

2.2. Convolution 

In the convolution sub-stage, the reconstruction is 
formed by convolving 2D unpooled feature maps with 
filters and summing them: 



(4) 



6=1 



Zb{i) = w{i)pk{j) yi e Nj 



(2) 



3D (un)pooling is also possible, as explored in [22], 



where * is the 2D convolution operator. The filters / are the 
parameters of the model common to all images. The feature 

^Combining Eqs. 2 and 3, we have Pb(j) = w'^(i)pb(j), hence 



maps z are latent variables, specific to each image. For no- 
tational brevity, we combine the convolution and summing 
operations into a single convolution matrix F and convert 
the multiple 2D maps 2:5 into a single vector z\ v = Fz. 

2.3. Discussion of Single Layer 

The combination of unpooling and convolution opera- 
tions gives the reconstruction v from p: 

A, 



but optimized over a set of images v^, . . . , during train- 
ing): 

(7) 
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A single layer of the model is shown in the lower part of 
Fig. 1(a). This integrated formulation allows the straight- 
forward optimization of the filters /, features p and the 
(un)pooling weights w to minimize a single objective func- 
tion. While most other models also learn filters and features, 
the pooling operation is typically fixed. Direct optimization 
of Eqn. 5 with respect to w is one the main contributions of 
this work and is described in Section 2.5. 

Note that, given fixed weights the reconstruction is 
linear in p, thus Eqn. 5 describes a tri-linear model, with w 
coding position (where) information about the (what) fea- 
tures p. 

Eqn. 5 differs from the original Deconvolutional Net- 
work formulation [ ] in several important ways. First, 
sparsity is imposed directly on p, as opposed to z. This 
integrates pooling into the objective function, allowing it to 
become part of the inference. Second, [ ] considers only 
a = 1, rather than the hyper-Laplacian (a < 1) sparsity we 
employ. Third, p is non-negative, as opposed to [22] where 
there was no such constraint. Fourth, and most importantly, 
by inferring the optimal (un)pooling weights w we directly 
minimize the objective function of the model. Fixed sum 
or max pooling, employed by other approaches, is a local 
heuristic that has no clear relationship to the overall cost. 

2.4. Multiple Layers 

Multi-layer models are constructed by stacking the sin- 
gle layer model described above in the same manner as 
Zeiler et al. [ ]. The feature maps p from one layer be- 
come the input maps to the layer above (which now has B 
"color channels"). 

An important property of the model is that feature maps 
exist solely at the top of the model (there are no explicit 
features in intermediate layers), thus the only variables at 
the intermediate layers are filters F and unpooling weights 
w. For an / layer model, the reconstruction v is: 

V = FiUnj^F2U^^ . . . FiU^^pi = Ripi (6) 

where F^ and 17^^ are the convolutional and unpooling op- 
erations from each layer k. We condense the sequence of 
unpooling and convolution operations into a single recon- 
struction operator Ru which lets us write the overall object 
for a multi-layer model (shown here for a single image, 



Ci{v) = hRipi-v\\l + \piU 



A multi-layer model is shown in Fig. 1(a). Note that since 
Ri is linear, given the (un)pooling weights w, the recon- 
struction term is easily differentiable. The derivative of Ri 
is simply Rj = U^^Ff . . . U^^F^, which is sl forward 
propagation operator. This takes a signal at the input and 
repeatedly convolves (using with flipped versions of the fil- 
ters at each layer) and pools (using weights wi) all the way 
up to the features. This is a key operation for both infer- 
ence and learning, as described in Section 3 and Section 4 
respectively. Fig. 1(b) illustrates the reconstruction and for- 
ward propagation operations. 

2.5. Differentiable Pooling 

We impose a parametric form on the (un)pooling weights 
w to ensure that the features are invariant to small changes 
in the input. The pooling would otherwise be able to memo- 
rize perfectly the unpooled features, giving "lossless" pool- 
ing which would not generalize at all. 

The parametric model we use is a 2D axis-aligned Gaus- 
sian, with mean {jix^piy) and precision {'^xily) over the 
pooling neighborhood Nj, introduced in Section 2.1. The 
Gaussian is normalized within the extent of the pooling re- 
gion to give weights w whose square sums to 1 (thus giving 
unit ^2 norm): 

/•^ 

w[i) 



V ) 

where a(i) is value of the Gaussian for element i, at location 
x{i),y{i) within the neighborhood Nj\ 



a{i) 



(9) 



Fig. 1(c) shows an illustration of this parameterization. For 
brevity, we let Oj = {/j^x^ jj.y^jx^'jy} he the parameters for 
neighborhood Nj . We thus rewrite the unpooling operation 
in Uwi as Uei- The Gaussian representation has several ad- 
vantages over existing sum or max pooling: 

• Varying the mean of the Gaussian selects a particular 
region in the unpooled feature map, just like max pool- 
ing. This makes the feature invariant to small transla- 
tions within the unpooled maps. 

• Varying the precision of the Gaussian allows a smooth 
variation between max and sum operations (high and 
low precision respectively). 

• Changes in precision allow invariance to small scale 
changes in the unpooled features. For example, the 
width of an edge can easily be altered by adjusting the 
variance (see Fig. 2(c)). 



• The continuous nature of the Gaussian allows sub- 
pixel reconstruction that avoids aliasing artifacts, 
which can occur with max pooling. See Fig. 5 for an 
illustration of this. 

• The Gaussian representation is differentiable, i.e. the 
gradient of Eqn. 5 with respect to Oj has analytic form, 
as detailed in Section 3.2. 

2.6. Non-Negativity 

In standard sparse coding and other learning methods 
both the feature activations and the learned parameters can 
be positive or negative. This contrasts with our model, in 
which we enforce non-negativity. 

This is motivated by several factors. First, there is no no- 
tion of a negative intensities or objects in the visual world. 
Second, the Gaussian parameterization used in the differen- 
tiable pooling scheme, described in Section 2.5 has positive 
weights, so cannot represent individual negative values in 
the unpooled feature maps. Third, there is some biological 
evidence for non-negative representations within the brain 
[10]. Finally, we find experimentally that non-negativity re- 
duces the flexibility of the model, encouraging it to learn 
good representations. The features computed at test-time 
have improved classification performance, compared with 
models without this constraint (see Section 6.4). 

2.7. Hyper-Laplacian Sparsity 

Most sparse coding models utilize the £i-norm to enforce 
a sparsity constraint on the features [ ], as a proxy for op- 
timizing io sparsity [ ]. However, a drawback of this form 
of regularization is that it gives the same cost to two ele- 
ments being 0.5 versus a single elements at 1 and the other 
at 0, even though the latter has a lower £q cost. 

To encourage features with lower Iq cost, we use a 
pseudo-norm ^o.5 (i-e. a = 0.5 in Eqn. 5) inspired by Kr- 
ishnan and Fergus [ ], which aggressively pushes small 
elements toward zero. To optimize this, we experimented 
with techniques in [ ], but settled on gradient descent for 
simplicity. 

3. Inference 

During inference, the filters / at all layers are fixed and 
the objective is to find the features p and (un)pooling vari- 
ables for all neighborhoods and all layers that minimize 
Eqn. 7. We do this by alternating between updating the fea- 
tures p and the Gaussian variables 6, while holding the other 
fixed. 

3.1. Feature Updates 

For a given layer we seek the features pi that minimize 
Ci{v) (Eqn. 7), given an input image v, filters /i , . . . , and 
unpooling variables 6>i , . . . , . This is a large convolutional 



sparse coding problem and we adapt the ISTA scheme of 
Beck and Teboulle [ ]. This uses an iterative framework of 
gradient and shrinkage steps. 

Gradient step: The gradient of Ci{v) with respect to pi 

is- dCi{v) 



Vpi 



dpi 



R[{RiPi-v) 



(10) 



This involves first reconstructing the input from the current 
features: v = RiPu computing the error signal e = {v — v), 
and then forward propagating this up to compute the top 
layer gradient Vpi = Rfe. Given the gradient, we then can 

update p^: x q v? n1^ 

Pi=Pi- XiPpi^pi (11) 

where the parameter sets the size of the gradient step. 

Shrinkage step: Following the gradient step, we per- 
form a per-element shrinkage operation that clamps small 
elements in pi to zero, increasing its sparsity. For a = 1, 
we use the standard ii shrinkage: 

max(|p/| - f3p^,0) • sign(pi) (12) 



Pi 



For a = 0.5, we step in the direction of the gradient: 



Pi=Pi- Pp 



1 



• sign(pi 



(13) 



Projection step: After shrinking small elements away, 
the solution is then projected onto the non-negative set: 

pi =max{pi,0) (14) 

Step size calculation: In order to set a learning rate 
for the feature map optimization, we employ an estimation 
technique for steepest descent problems [ lO] which uses the 
gradients \/pi 



dC 



(15) 



VpjRjRiVpi 

Automating the step- size computation has two advantages. 
First, each layer requires a significantly different learning 
rate on account of the differences in architecture, making it 
hard to set manually. Second, by computing the step-size 
before each gradient step, each ISTA iteration makes good 
progress at reducing the overall cost. In practice, we find 
fixed step-sizes to be significantly inferior. 

Vpi is computed once per mini-batch. For efficiency, 
instead of computing the denominator in Eqn. 15 for each 
image, we estimate it by selecting a small portion (~10%) 
of each mini-batch. 

Reset step: Repeated optimization of the objective func- 
tion tends to get stuck in local minima as it proceeds over 
the dataset for several epochs. We found a simple and ef- 
fective way to overcome this problem. By setting all feature 
maps pi to every few epochs (essentially re-initializing in- 
ference), cleaner filters and better performing features can 
be learned, as demonstrated in Section 6.5. 

This reset may be explained as follows. During alternat- 
ing inference and learning stages, the model can overfit a 



mini-batch of data by optimizing either the filters or feature 
maps too much. This causes the model to lock up in a state 
where no new feature map element can turn on because the 
reconstruction performance is sufficient to have only a small 
error propagating forward to the feature level. Since no new 
features turn on after shrinkage, the filters remain fixed as 
they continue to get the same gradients. This can happen 
early in the learning procedure when the filters are still not 
optimal and therefore the learned representation suffers. By 
resetting the feature maps, at the next epoch the model has 
to reevaluate how to reconstruct the image from scratch, and 
can therefore turn on the optimal feature elements and con- 
tinue to optimize the filters. 

3.2. (Un)pooling Variable Updates 

Given a model with / layers, we wish to update the 
(un)pooling variables 6k at each intermediate layer k to 
optimize the objective Ci{v). We assume that the filters 
/i , . . . , and features pi are fixed. 

The gradients for the pooling variables Ok involve com- 
bining, at layer k, the forward propagated error signal with 
the top down reconstruction signal. This combined signal 
then drives the update of the pooling variables. More for- 
mally: 

= Rl{Ripi - v) • {R^i^k)Pi) (16) 

where Ri^k^^ the top down reconstruction from layer / fea- 
ture maps to layer k feature maps and R^ is the error prop- 
agation up to Zk. 

With the chosen Gaussian parameterization of the pool- 
ing regions, the chain rule can be used to compute the gra- 
dient for each parameter 6k = {/i^; , M?/ , 7x , 7?/ } • 

™ dCi{v) ^ dCi{v) dUo,{i') ^ dw{i') da{i) 
VUk = — ^ 



dOkij) ^f^,dUedi') dw{i') da{z) dOk{j) 
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where j is the neighborhood index 



= Pk{i') = {R(i-,k)Pi){i') 
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Algorithm 1 Learning with Differentiable Pooling in De- 

convolutional Networks 

Require: Training set Y, # layers L, # epochs E, # ISTA steps T 
Require: Regularization coefficients A^, # feature maps Bi 
Require: Pooling step sizes /3ui 
1: for / = 1 : L do %% Loop over layers 
2: Init. features/filters: pj - 0, - A/'(0, e) 
3: Init. switches: Oj = Fit{Rjyi) \fi 
4: for epoch = 1 : E do %% Epoch iteration 
5: for i = 1 : N do %% Loop over images 
6: for t = 1 : T do %% ISTA iteration 

7: Reconstruct input: vi^ — Rip] 

8: Compute reconstruction error: e = vi' — v'^ 

9: Propagate error up to layer /: Vpi — RJ e 

10: Estimate step size as in Eqn. 15 

11: Take gradient step on p: p\ = p\ — Xi/Sp^ Vpi 

12: Perform shrink: p} — max(|pj| — , O)sign(pJ) 

13: Project to positive: p] — max(pj, 0) 

14: for k — 1 : I do %% Loop over lower layers 

15: Take gradient step on 0: 0% = 01 - Xifiuk 

16: end for 

17: end for 

18: end for 

19: Update fi by solving Eqn. 26 using CG 
20: Project fi to positive and unit length 
21: end for 
22: end for 

23: Output: filters /, feature maps p and pooling variables 0. 



Once the complete gradient is computed as in Eqn. 17, 
we do a gradient step on each pooling variable: 

0k = 0k-Xii3u,V0k (25) 

using a fixed step size • We experimented with a simi- 
lar step size to Eqn. 15 for the pooling parameters, however 
found the estimates to be unstable, likely due to the nonlin- 
ear derivatives involved in the Gaussian pooling. 

4. Learning 

After inference of the feature maps for the top layer and 
(un)pooling variables for all layers is complete, the filters in 
each layer are updated. This is done using the gradient with 
respect to each layer's filters: 

dCi 



where x{i) and y{i) are the coordinates within the pooling 
neighborhood Nj . 



= Xi[Ri_,{Ripi - v)r * [{Ue,R^i^k)Pi)]b (26) 



where the left term is the bottom up error signal propagated 
up to the feature maps below the given filters, Pk-i and the 
right term is the top down reconstruction to the unpooled 
feature maps Zk. The gradient is therefore the convolution 
between all combinations of input error maps to the layer 
(indexed by c) and the unpooled feature maps reconstructed 
from above (indexed by b), resulting in updates of each filter 
plane for each layer k. 

In practice we use batch conjugate gradient updates for 
learning the filters as the model is linear in Fk once the fea- 



ture maps and pooling parameters are inferred. After 2 steps 
of conjugate gradients, the filters are projected to be non- 
negative and renormalized to unit £2 length. 

4.1. Joint Inference 

The objective function explicitly constrains the recon- 
struction from the top layer features to be close to the input 
image. From this we can calculate gradients for each layer's 
filters and pooling variables while optimizing the top level 
features maps. Therefore for each image we can infer the 
local shifts and scalings of low level features as the high 
level concepts develop. 

We have found that pre-training the first layer in one 
phase of training and then using the pooling variables and 
learned layer 1 filters to initialize a second phase of train- 
ing works best. The second phase of training optimizes 
the second layer objective from which we can update p2, 
Uw2^ Uwi^ F2, and Fi jointly. If care is not taken in this 
joint update, the first layer features can trade off represen- 
tation power with the second layer filters. This can result in 
the second layer filters capturing the details while the first 
layer filters become dots. To avoid this problem, after the 
first phase of training we hold Fi fixed and optimize the re- 
maining variables jointly. Thus, while the filters are learned 
layer-by-layer, inference is always performed jointly across 
all layers. This has the nice property that these low level 
parts can move and scale as the 1/^^ variables are optimized 
while the high level concepts are learned. 

5. Initialization of Parameters 

Before training, the filter parameters are initialized to 
Gaussian distributed random values. After this random ini- 
tialization, the filters are projected to be non-negative and 
normalized to unit length before training begins. 

Before inference, either at the start of training or at test 
time, we initialize the features maps to 0. This creates a 
reconstruction of in the pixel space, therefore the initial 
gradient being propagated up the network is —y. This is 
similar to a feedforward network for the first iteration of in- 
ference. While forward propagating this signal up the net- 
work we can leverage the Gaussian parameterization of the 
pooling regions to fit these pooling parameters using mo- 
ment matching. That is, at each layer, we extract the opti- 
mal pooling parameter that fit this bottom up signal. This 
provides a natural initialization to both the pooling variables 
at each layer and the top level feature activations given the 
input image and the filter initialization. 

6. Experiments 

Evaluation on MNIST We choose to evaluate our model 
on the MNIST handwritten digit classification task. This 
dataset provides a relatively large number of training in- 
stances per class, has many other results to compare to, and 



allows easy interpretation of how a trained model is decom- 
posing each image. 

Pre-processing: The inputs were the unprocessed 
MNIST digits at 28x28 resolution. Since no preprocessing 
was done, the elements remained nonnegative. 

Model architecture: We trained a 2 layer model with 
5x5 filters in each layer and 2x2 non-overlapping pooling 
regions. The first layer contained 16 feature maps and the 
second layer contained 48 features maps. Each of these 48 
feature maps connect randomly to 8 different layer 1 feature 
maps through the second layer filters. These sizes were cho- 
sen comparable to [ ] while being more amenable to GPU 
processing. The receptive fields of the second layer features 
are 14x14 pixels with this configuration, or one quarter the 
input image size. 

Classification: One motivation of this paper was to an- 
alyze how the classification pipeline of Zeiler et ah [^^] 
could be simplified by making the top level features of the 
network more informative. Therefore, in this paper we sim- 
ply treat the top level activations inferred for each image as 
input to a linear S VM [ ] . 

The only post processing done to these high level activa- 
tions is that overlapping patches are extracted and pooled, 
analogous to the dense SIFT processing which is shown by 
many computer vision researchers to improve results [ ]. 
This step provides an expansion in the number of inputs, al- 
lowing the linear SVM to operate in a higher dimensional 
space. For layer 1 classification these patches were 9x9 el- 
ements of the layer 1 features maps. For layer 2 they were 
6x6 patches, roughly the same ratio to the feature map size 
as for layer 1 . These patches were concatenated as input to 
the classifier. Throughout the experiments we did not com- 
bine features from multiple layers, concatenating only layer 

1 patches together for layer 1 classification and only layer 

2 features together for layer 2 classification. These final in- 
puts to the classifier were each normalized to unit length. 

Hyperparameters: By cross validating on a 50,000 
train and 10,000 validation set of MNIST images, we found 
that Ai = 2 and A2 = 0.5 gave optimal classification perfor- 
mance. Each layer was trained with 100 ISTA steps/epoch 
for 50 epochs (passes through the dataset). After epoch 25, 
the feature maps were reset to during training. At test 
time, we found higher Ai = 5 and A2 = 5 improved classi- 
fication, as did optimizing for only 50 ISTA steps of infer- 
ence. 

6.1. Model visualization 

By visualizing the filters and features maps of the model, 
we can easily understand what it has learned. In Fig. 2 
(a) we demonstrate sharp reconstructions of the input im- 
ages from the second layer features maps. In Fig. 2 (b) 
we display the raw filter coefficients for layer 1 which have 
learned small pieces of strokes. By incorporating the pool- 
ing parameters into the layer, these filters are robust to small 
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Figure 2. Visualization of the trained model: (a) reconstructions 
from layer 2, (b) the 16 layer 1 filter weights, (c) invariance vi- 
sualization for layer 1 incorporating unpooling and convolutions 
(see Section 6.1 for details) (d) layer 2 filter weights (shown as 16 
groups of filter planes connecting to all 48 layer 2 maps), (e) layer 
2 pixel space invariance visualization of features projected down 
from samples of the layer 2 feature distribution (see Section 6.1). 



changes in the input. 

Visualizing these invariances of a model can be help- 
ful in understanding the inputs the model is sensitive to. 
Searching through the dataset of inferred feature map ac- 
tivations and selecting the maximum element per feature 
map to project downward into the pixel space as in [ ] 
is one way of visualizing these invariances. However, these 
selected elements are only exemplars of inputs that most 
strongly activated that feature. In Fig. 2(c) we show a more 
representative selection of invariances by instead selecting 
a feature activation to be projected down based on sampling 
from the distribution of activations for that feature inf the 
dataset. This gives a less biased view of what activates 
that feature than selecting the largest few activations from 
the dataset. Once a sample is selected for a given feature 
map, the pooling variables corresponding to the image from 
which the activation was selected are used in the unpooling 
stages to do the top down visualization. 

Examining the 16 sample visualizations for each feature 
in Fig. 2(c) shows the scale and shifts that the Gaussian 
pooling provides to these relatively simple first layer filters. 
We can continue to analyze the model by viewing the layer 
2 filters planes in Fig. 2(d). Each of the 48 second layer 
features has 16 filter planes (shown in separate groups), one 
connecting to each of the layer 1 feature maps. While the 
second layer filters are difficult to understand directly, we 
can visualize the learned representation of the second layer 
by projecting down all the way to the pixel space through 
layer 1. Fig. 2(e) shows for each of the 48 feature maps 
a 4x4 grid of pixel space projections obtained by sampling 
16 activations from the distribution of activations of each 
layer 2 feature and projecting down via alternating convo- 
lution and unpooling with the corresponding pooling vari- 
ables separately for each activation. 

While analyzing the features in pixel space is informa- 
tive, we have also found it is useful to view the features as 
decompositions of an input image to know how the model 
is representing the data. One possible method of displaying 
the decomposition is by coloring each pixel of the recon- 
struction according to which feature it came from. Each 
feature is assigned a hue (in no particular order) and the as- 
sociated reconstruction produced then defines the saturation 
of that color. The resulting image therefore depicts the high 
level feature assignments. Pixels with brownish colors indi- 
cate a summation of several colors (features) together. Note 
that the input images themselves are grayscale - the colors 
are just for visualization purposes. 

In Fig. 3(d) we show such a reconstruction from layer 
1 for the original image in (e). To understand the model 
we also show the layer 1 feature map activations in (a) with 
their corresponding color assignment around them. Notice 
the sparse distribution of activations can reconstruct the en- 
tire image by utilizing the Gaussian pooling and layer 1 fil- 
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(c) Reconstructed Layer 1 feature maps (p^) 











B 










.J 1 . . 

J 

















(c) Layer 1 filters (f ) 

irj= 



(b) Layer 1 unpooled feature maps (z^) 
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Figure 3. One layer decomposition of a digit into parts. From the 
top down the layer 1 feature maps (a) are unpooled into (b) and 
convolved with (e) to produce the reconstruction (f). The colors 
in the reconstruction simply represent which feature the recon- 
structed pixel came from. 



Figure 4. Two layer decomposition of a digit into parts. From the 
top down the layer 2 feature maps (a) are unpooled into (b) and 
convolved with the layer 2 filters to produce the reconstruction of 
layer 1 feature maps (c). These are unpooled into (c) and con- 
volved with (e) to produce (f), colored according to the layer 2 
feature that the reconstructed pixel it was reconstructed from. 



ters in (c). Fig. 3(b) shows the result of this unpooHng oper- 
ation on the feature maps. Notice in the orange and purple 
boxes the elongated Hnes in the unpooled maps, made pos- 
sible by a low precision in one dimension. 

Fig. 4 takes this analysis one step further by using the 
second layer of the model. Starting from 3 features in the 
layer 2 feature maps as shown in (a), they are unpooled (as 
shown in (b)) and then convolved with the second layer 
filters to reconstruct many elements down on to the first 
layer features maps (c). These are further unpooled to (d) 
where again you can see the benefits of the Gaussian pool- 
ing smoothly transitioning between non-overlapping pool- 
ing regions. These are finally convolved with first layer fil- 
ters (e) to give the decomposition shown in (f). Notice how 
long range structures are grouped into common features in 
the higher layer compared to the layer 1 decomposition of 
Fig. 3. 

6.2. Max Pooling vs Gaussian Pooling 

The discrete locations that max pooling allows within a 
region are a limiting factor in the reconstruction quality of 
the model. Fig. 5 (bottom) shows a significant aliasing ef- 
fect is present in the visualizations of the model when Max 
pooling is used. With the complex interactions between 
positive and negative elements removed, the model is not 
able to form smooth transitions between non overlapping 
pooling regions even though the filters used in the succeed- 
ing convolution sublayer have overlap between regions. Us- 
ing the Gaussian pooling, the model can infer the desired 
precisions and means in order to optimize the reconstruc- 
tion quality from high layers of the model. 

This fine tuning of reconstruction allows for improve- 
ments without significantly varying the features activations 
(ie. maintains or decreases the sparsity while adjusting the 
pooling parameters). This is confirmed in Fig. 6 where we 
break down the cost function into the reconstruction and 
regularization terms. In this figure we also display the £o 
sparsity of each model as this can directly be used for com- 
parison. 

The Gaussian pooling significantly outperforms Max 
pooling in terms of optimizing the objective. By not being 
able to adjust the pooling variables to optimize the over- 
all cost. Max pooling plateaus despite running for many 
epochs. Additionally it has a much higher £q cost through- 
out training. In contrast, the £o cost with Gaussian pooling 
decreases smoothly throughout training because the model 
can fine tune the pooling parameters to explain much more 
with each feature activation. This property is shown in 
Table 1 to significantly improve classification performance 
compared to Max pooling when stacking. 





Layer 1 


Layer 2 


Max Pooling 
Gaussian Pooling 


1.30% 
1.38% 


1.25% 
0.84% 



Table 1. MNIST error rate of Max pooling versus Gaussian pool- 
ing for 1 and 2 layer models. Note the performance improvement 
when stacking layer with Gaussian pooling. 
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Figure 5. Feature decomposition comparison between Gaussian 
pooling (top) and max pooling (bottom). Each reconstructed 
pixel's color corresponds to the layer 2 feature map it was recon- 
structed from. Note the reuse of similar strokes in digits of a dif- 
ferent class. Aliasing artifacts are present in the reconstructions 
using max pooling - see Section 6.2. 
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Figure 6. Breakdown of cost function into reconstruction and regu- 
larization terms for Max and Gaussian pooling for 2 layer models. 
Gaussian pooling gives consistently lower cost than max pooling. 
Furthermore, the £o sparsity (shown in blue), is significantly lower 
for the Gaussian pooling, although not explicitly part of the cost. 

6.3. Joint Inference 

One of the main criticisms of sparse coding methods is 
that inference must be conducted even at test time due to 
the lack of a feedforward connection to encode the features. 
In our approach we discovered two fundamental techniques 
that mitigate this drawback. 

The first is that running a joint inference procedure over 
both layers of our network improves the classification per- 
formance compared to running each layer separately. In- 
stead of inferring the feature maps and pooling variables 
for the first layer and then using these pooling variables to 
initialize the second layer inference (2 phases), we can di- 
rectly run inference with a two layer model. The differ- 
entiable pooling allows us to infer the pooling variables of 
both layers in addition to the layer 2 feature values simul- 
taneously in 1 phase. At the first iteration of inference we 
leverage the ability to fit the Gaussian pooling parameters in 
a feed forward way as mentioned in Section 5. This halves 
the number of inference iterations needed by not requiring 
any first layer inference prior to inferring the second layer. 

To examine this first discovery in depth we considered 
several combinations of how to joint train and then run in- 
ference at test time with this model. During training we 
have found both qualitatively in terms of feature diversity 
and quantitatively in terms of classification performance 
that training in separate phases, one for each layer of the 
model, works better than jointly training both layers from 
scratch. In the second phase of training, when optimizing 
for reconstruction from the second layer feature maps, the 
first layer pooling variables and filters can either be updated 
or held fixed. Each row of Table 2 examines each combi- 



nation of these updates during training. We can see that the 
optimal training scheme was with fixed first layer filters but 
pooling updates on both layers. This made the system more 
stable while still allowing these first layer filters to move 
and scale as needed by updating the first layer pooling vari- 
ables. 

In all cases we see a significant reduction in error rates 
when doing inference in 1 phase. The middle column of the 
table shows this 1 phase inference, but without optimizing 
the first layer pooling parameters Ui whereas the last col- 
umn does optimize Ui. We see an improvement in updating 
Ui for all but the last row which was trained without Ui and 
so is used to that type of inference. This improvement with 
joint inference of [/i, [/2, and p2 is a key finding which is 
only possible with differentiable pooling. 



Training 


Infer 2 


Infer 1 


Infer 1 




phases 


phase (no Ui) 


phase 


Updating Fi Uu,^ 


1.79% 


1.63% 


1.40% 


Updating Fi 


1.71% 


1.21% 


1.10% 


Updating Uw-^ 


1.39% 


1.04% 


0.84% 


No Layer 1 Updates 


1.46% 


0.99% 


1.03% 



Table 2. Comparison of joint training techniques. Each row is a 
trained two layer model that updates select variables in layer 1 
during training (in addition to F2 and Uw^)- The three columns 
use these models but run inference at test time in 2 phases, 1 phase 
without updating Ui , and 1 phase with all updates respectively. 

The second discovery that reduces evaluation time is that 
running the same number of ISTA iterations as was done 
during training does not give optimal classification perfor- 
mance, possibly due to over-sparsification of the features. 
Similarly running with too few iterations also reduces per- 
formance. Eig. 7 shows a plot comparing the number of 
ISTA iterations to the classification performance with an op- 
timum at 50 ISTA steps, half the number used during train- 
ing. 
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Figure 7. Comparison of classification errors versus number of 
ISTA steps used during inference. 



6.4. Effects of Non-Negativity 

With negative elements present in the system, many pos- 
sible solutions can be found during optimization. This hap- 
pens because subtractions allow the removal of portions of 
high level features. This has the effect of making them less 
discriminative because the model can change parameters in- 
between the high level feature activations and the input im- 
age in order to reconstruct better while assigning less mean- 
ing to the feature activations themselves. 

To show this is not an artifact of the Gaussian pooling 
being more suited to nonnegative systems (due to the sum- 
mation over the pooling region possibly leading to cancel- 
lations if negatives are present), we include comparison in 
Table 3 to Max pooling. In both cases, enforcing positivity 
via projected gradient descent improves the discriminative 
information preserved in the features. 





Po sitive/Negative 


Non-negative 


Max Pooling 
Gaussian Pooling 


2.04% 
2.32% 


1.25% 
0.84% 



Table 3. MNIST error rate for Max and Gaussian models trained 
with and without the non-negativity constraint. 

6.5. Effects of Feature Reset 

When training the model on MNIST, some less than op- 
timal filters are learned when not resetting the feature maps. 
For example, in Fig. 8 (c) many of these layer 1 filters are 
block-like such as the 3rd row, 2nd column. However this 
same feature in (a) improves if the feature maps are reset 
to once half way through training. This single reset is 
enough to encourage the filters to specialize and improve. 
Similarly, the layer 2 pixel visualizations in (b) have much 
more variation due to the reset compared to (d) which did 
not have the reset. In particular, notice many blob-like fea- 
tures learned in (d) without reset such as the 2nd and 5th 
rows of the 1st column that improve in (b). These larger, 
more varied features learned with the reset help improve 
classification performance as shown in Table 4. 



Trained with No Reset 


Trained with Reset 


1.00% 


0.84% 



Table 4. MNIST error rates for 2 layer models trained with and 
without resetting the feature maps. 



6.6. Effects of Hyper-Laplacian Sparsity 

It has previously been shown that sparsity encourages 
learning of distinctive features, however it is not necessar- 
ily useful for classification [18] [22]. We analyze this in the 
context of hyper-laplacian sparsity applied to both training 
and inference. In this comparison we trained two models, 
one with a li prior on the feature maps and the other with 





Figure 8. Qualitative difference in first layer filters with (left) and 
without (right) resetting of the feature maps. 



a ^0.5 prior. Once trained, we took each model and ran in- 
ference with both li and ^o.5 priors. For reference the 
sparsity for the training runs was 4.2 for the ^o.5 regular- 
ized training and 20.2 for the li regularized training with 
the same A2 = 0.5 setting. Since the amount of sparsity can 
also be controlled during inference by the A2 parameter, we 
plot in Fig. 9 the classification performance for various A2 
settings in these four model combinations. 
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Figure 9. Error rates for ^l and ^0.5 priors used in training and 
inference. 

Interestingly, utilizing the added sparsity during train- 
ing enforced by the ^0.5 while using the more relaxed li 
prior for inference is the optimal combination for all A set- 
tings. This suggests sparsity is useful during training to 
learn meaningful features, but is not as useful for inference 
at test time. 

6.7. Comparison to Other Methods 

We chose the MNIST dataset for it's large number of 
results to compare to. Of these, deep learning methods typ- 
ically fall into one of two categories, 1) those that are com- 
pletely unsupervised and have a simple classifier on top, or 
2) those that are fine-tune discriminatively with labels. Our 
method falls into the first category as it is completely unsu- 



Pre-training Fine-tuning 
Our Method 0.84% - 

CDBN (1+2 layers) [ ] 0.82% 
DBN (3 layers) [ ] [ ] 2.5% 1.18% 

DBM (2 layers) [ ] | - | 0-95% 
Table 5. MNIST errors rates for related generative models. 

pervised during training, and only the linear SVM applied 
on top has access to the label information of the training 
set. We do not back propagate this information through the 
network, but this would be an interesting future direction to 
pursue. Table 5 shows our method is competitive with other 
deep generative models, even surpassing several which use 
discriminative fine tuning. 

7. Discussion 

In this work we introduced the concept of differentiable 
pooling for deep learning methods. Also, we demonstrated 
that joint training the model improves performance, posi- 
tivity encourages the model to learn better representations, 
and that there is an optimal amount of sparsity to be used 
during training and inference. Finally, we introduced a sim- 
ple resetting scheme to avoid local minimum and learn bet- 
ter features. We believe many of the approaches and find- 
ings in this work are applicable not only to Deconvolutional 
Networks but also to sparse coding and other deep learning 
methods in general. 
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